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Eindhoven University of Technology, North Carolina State University 
and University of Amsterdam 

We propose a two-stage procedure for estimating the location fi 
and size M of the maximum of a smooth d-variate regression func- 
tion /. In the first stage, a preliminary estimator of fj, obtained from 
a standard nonparametric smoothing method is used. At the second 
stage, we "zoom-in" near the vicinity of the prehminary estimator 
and make further observations at some design points in that vicin- 
ity. We fit an appropriate polynomial regression model to estimate 
the location and size of the maximum. We establish that, under suit- 
able smoothness conditions and appropriate choice of the zooming, 
the second stage estimators have better convergence rates than the 
corresponding first stage estimators of fj, and M. More specifically, 
for a-smooth regression functions, the optimal nonparametric rates 
^-{a-i)/{2a+d) ^-a/(2Q,+d) ^j^g g^g^gg be impr oved to 

^-{a-i)/(2a) g^^^ n-^/^ respectively, for a > 1 -h y'l + d/2. These 
rates are optimal in the class of all possible sequential estimators. 
Interestingly, the two-stage procedure resolves "the curse of the di- 
mensionality" problem to some extent, as the dimension d does not 
control the second stage convergence rates, provided that the function 
class is sufficiently smooth. We consider a multi-stage generalization 
of our procedure that attains the optimal rate for any smoothness 
level Q > 2 starting with a preliminary estimator with any power-law 
rate at the first stage. 

1. Introduction. In many applications, it is of interest to estimate the 
location and size of the extremum of a univariate or multivariate regression 
function. For instance, an oil company may be interested in determining the 
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best location for drilling a well in a confined region. Based on information 
obtained from drilling at a few preliminary locations in the region, the goal 
is to obtain an estimate of the best location and the amount of the reserve 
based on these noisy measurements. 

Suppose we observe noisy measurements of an unknown regression func- 
tion / : M'^ — 7- M, sampled at points from some compact, convex set D C W^, 

(1) n = /(xfc)+Cfc, XfcGDcM^ A: = l,...,n, 

where the ^^'s are independent zero mean errors with Var(,^fc) = o"^. Clearly, 
for estimating any feature of /, the estimation error increases with o"^. Thus 
among all error distributions satisfying Var(^fc) < o"^ for every k, the ho- 
moscedasticity condition Var(^fc) = o"^ is the least favorable. This shows 
that the latter condition can be relaxed to the former without increasing the 
bound on error of estimation, and the obtained rates under the homoscedas- 
ticity condition remains minimax optimal under the larger heteroscedastic 
model. 

Assume that / has a unique maximum at /x in the interior of that is, 

(2) max/(x) = /(/x) = M, /(x) < /(/x) forallx//x. 

If the function / is sufficiently smooth, then the gradient V/(/x) = and 
the Hessian matrix of / at /i is nonpositive definite. The goal is to estimate 
the maximum of the regression function M = /(/x) and its location /x. 

Clearly, the choice of the design points {xfc, /c = 1, . . . ,n} significantly in- 
fiuences the estimation accuracy. There are two basic design settings: fixed 
in advance (or randomly sampled from a chosen distribution) and sequential, 
where one is allowed to use the information obtained from an earlier sample 
to determine subsequent design points. If the design is fixed and nothing 
is known about the location of the maximum, the design points should be 
"almost uniformly" spread out all over the set of interest D. The problem 
of estimating the location and size of extrema of nonparametric regression 
functions for the fixed design situation has been studied by many authors. 
The one-dimensional case is thoroughly investigated, whereas the study in 
the multivariate situation has been limited; see Miiller (1985, 1989), Shoung 
and Zhang (2001), Facer and Miiller (2003) and the references therein. The 
minimax rate for estimating the maximum value of the function ranging 
over an a-smooth nonparametric class (e.g., isotropic Holder class defined 
below) is 7),-"/{2a+ci) ^Y^Q estimation of the location of the maximum, 

it is a folklore that the minimax rate is the same as the minimax rate for 
estimating the first derivative of the regression function, which is given by 
j^-(o-i)/{2a+d) _ jj^ setting of estimating the mode /i of a univariate twice 
differentiable density /, Hasminskii (1979) showed that under the assump- 
tion that /"(/u) < 0, the lower bound for the minimax risk rate is of the order 
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n~^l^ consistent with the rate n"^""^)/^^'^"'''^^ Klemela (2005) considered the 
problem of adaptive estimation of the mode of a multivariate density with a 
bounded support that satisfies, in a neighborhood of the mode, a smoothness 
condition of a level higher than 2. 

If we can choose a design point before making each observation using the 
data obtained so far, then we are in the classical sequential design setting. 
Kiefer and Wolfowits (1952) introduced a Robbins-Monro type of algorithm 
to estimate the mode /i of / in the univariate framework. Blum (1954) pro- 
posed a multivariate version of their algorithm which allows to estimate the 
location of the maximum of a multivariate regression function /. Since 
then, this Kiefer-Wolfowits-Blum recursive algorithm has been extended 
in many directions by many authors. The main fact is that the algorithm 
converges to /x with the rate under the assumption that the regres- 

sion function / is three times differentiable. More generally, Chen (1988) 
and Polyak and Tsybakov (1990) established that, in the sequential de- 
sign setting, the minimax rate for estimating the location of the maximum 
of a-smooth regression functions is n~^"~^^/^^"^ . Dippon (2003) proposed 
a general class of randomized gradient recursive algorithms which attain 
the optimal convergence rate. Mokkadem and Pelletier (2007) considered 
the problem of simultaneously estimating, in the sequential design setting, 
the location and the size of the maximum of a regression function that is 
three times continuously differentiable. They proposed a companion recur- 
sive procedure to the Kiefer-Wolfowits-Blum algorithm so that, by applying 
both the companion and the Kiefer-Wolfowits-Blum algorithms, one can si- 
multaneously estimate the location and size of the maximum of regression 
functions in an on-line regime. Interestingly, in a sequential design setting, 
the convergence rate for estimating the maximum itself M = /(/x) can, in 
principle, attain the parametric rate n~^/^. The companion procedure of 
Mokkadem and Pelletier (2007) for estimating the maximum can also achieve 
the parametric rate n~^/^, but this companion procedure must use different 
design points than those used in the Kiefer-Wolfowits-Blum procedure. 

In this paper, we propose a two-stage strategy to tackle the problem of si- 
multaneously estimating the location fi and size M of the maximum of the 
regression function / according to the observation scheme (1). This is an 
approach in between the two above described frameworks — global fixed de- 
sign and a fully sequential design. Often, from an operational point of view, 
fully sequential sampling can be expensive, whereas a two-stage procedure is 
much simpler to implement. Our findings establish that the two-stage pro- 
cedure can be properly designed to match the strength of a fully sequential 
procedure. Moreover, the same design scheme can be used to obtain the 
optimal rates for estimating both /x and M. 

Now we describe the two-stage procedure. We construct a preliminary 
estimator /i of /i by spending a portion of our sampling budget to make ob- 
servations over a relatively uniform grid of points in the area of interest and 
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applying some standard nonparametric smoothing method for the fixed de- 
sign setting based on this initial set of data. Additional prior information, if 
available, may also be used to reduce the span of the design points or to more 
efficiently choose design points leading to increased accuracy of the prelimi- 
nary estimator. At the second stage, we "zoom-in" on a neighborhood of fi 
of an appropriate size 5n, to be called the localization parameter. The idea is 
that if this vicinity is "small enough," that is, the preliminary estimator fi 
converges to /z, the regression function / can be accurately approximated by 
a Taylor polynomial within the vicinity of ji. We then spend the remaining 
portion of the sampling budget to gather further observations at appropri- 
ately chosen design points in the vicinity of fx. Finally, we fit a polynomial 
regression model on the new set of data and show that the remainder of the 
expansion is appropriately small, provided that the preliminary estimator 
jl has sufficient accuracy. This procedure leads to improved estimators of /x 
and M and does not use knowledge of the noise variance cr^ . The last step 
in our approach is reminiscent of the nonparametric methodology of local 
polynomial regression in case of fixed design setting; see Fan and Gijbels 
(1996). Our two-stage procedure is motivated by the recent work of Lan, 
Banerjee and Michailidis (2009) and Tang, Banerjee and Michailidis (2011), 
who, respectively, considered such procedures for estimating change points 
in a regression function and the level point of a univariate monotone re- 
gression function. Motivating grounds for a two-stage approach were nicely 
described by them. The principal differences between their and our tech- 
niques are that we consider smooth rather than step or monotone functions, 
and we use polynomial regression of an appropriate degree in the second 
stage rather than regression based on step or linear functions, respectively, 
used by them. 

The results for estimating fi and M under the fully sequential setting, 
which we are aware of, all follow the Robbins-Monro procedure, where the 
next design point depends only on the previous observation and does not in- 
corporate all available information up to the current moment. In this setting, 
one makes observations only along a certain path of design points, eventually 
leading to the location of the maximum. In our two-stage approach, one also 
gets the global estimate of the regression function from the first stage all 
over the area of interest, which may be useful in some practical situations. 
We also get an accompanying estimator for the size of the maximum M (in 
fact, for all the relevant derivatives at the location of the maximum) in a 
natural way, while in a Robbins-Monro type sequential design, one needs to 
adjust the design points to estimate M. This can place serious constraints 
on the available budget since typically both and M need to be estimated. 

Our main result gives a decomposition of the convergence rate of the 
second stage estimator as the sum of an approximation term and a stochas- 
tic term, similar to the classical bias-variance trade-off. An implication of 
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the main result is as follows. Suppose we take a preliminary nonparametric 
estimator p, with the optimal single-stage convergence rate n~^'^~^^^^'^"^^\ 
Then by applying our two-stage procedure with an appropriate choice of 
the localization parameter 6n, we obtain optimal (for the sequential design 
setting) convergence rates, n~('^~^)/(^°) and n~^/^, respectively, under the 
condition on the smoothness parameter a> 1 + y^l + d/2. Note that n~^/^ 
is also the "oracle rate" for estimating M corresponding to taking n sam- 
ples at the "perfect location" fi. Thus, for a-smooth regression functions, 
the second stage improves the rates in estimating fi and M from the non- 
parametric rates n~^°'~^^^^'^°'~^'^^ and 72-"/(2"+rf) optimal sequential 
rates n"*^""^^/*^^") and n"^^"^, respectively. Curiously, the dimension d disap- 
pears from powers in the second stage convergence rates. However, the curse 
of dimensionality is still present in a milder form through the constraint 
a > 1 + + d/2. For instance, if a > 3, then the second stage rates are op- 
timal for d= 1, . . . , 6. We can resolve the curse of dimensionality completely 
by considering a multi-stage generalization of the two-stage procedure, ob- 
tained by iterating the second stage operation on the estimator obtained in 
the second stage, and continuing the iteration sufficiently many times. We 
shall show that after an appropriate number of stages, the optimal conver- 
gence rates are attained for any a > 2. In fact, even if we start with a not 
necessarily optimal preliminary estimator at the first stage (as long as it has 
a convergence rate of a power-law type) , this multi-stage approach will lead 
to the optimal resulting stage after a finite number of stages. The number of 
stages depends on the smoothness of the regression function and the quality 
(convergence rate) of the preliminary estimator from the first stage. The 
method still uses knowledge of the smoothness level a in its formulation, 
and hence is not adaptive for estimating fi. Nevertheless, the multi-stage 
procedure achieves the optimal rate n~^/^ for estimating M without using 
the knowledge of a. 

The paper is organized as follows. In Section 2, we introduce the notation 
and assumptions. Section 3 describes the two-stage procedure and states 
the main result. The multi-stage generalization is discussed in Section 4, 
and some simulation results are given in Section 5. Proofs are presented in 
Section 6. Some auxiliary results are given in the Appendix. 

2. Notation, preliminaries and assumptions. We describe the notation 
and conventions to be used in this paper. All asymptotic relations and 
symbols [like 0{6n), o{Sn), Op{6n), Op(5„) etc.] will refer to the asymptotic 
regime n — )• oo; here Cn = 0{6n) [resp., c„ = o{5n)] means that that c„/5„ is 
bounded (resp., c„/(5„ — )• 0) and for a stochastic sequence Xn, Xn = Op{5n) 
[resp., Xn = Op{5n)] means that that PjlXnl < Kcn} — ?• 1 for some con- 
stant K (resp., P{|X„| < e5n} — 1 for all e > 0). For numerical sequences 
/3n and P'n, by /?„ < /3'^ (or /3; > /? n) we mean that fSn — o(/3(^), while by 
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> /3„, we mean that /3„ = 0{/3'^). By /3„ x f3'^ we mean that = 0{j3'^) 
and 13'^ = 0{/3n). Let N stand for {0, 1,2,.. .}. For a set S, denote by jS"] the 
number of elements in S. Vectors are represented by bold symbols and can 
be upper or lowercase English or Greek letters. All vectors are in the column 
format with the corresponding nonbold letters with subscripts denoting the 
components, that is, for x,Xfc G M*^, x = (xi, . . . ,Xd) and x^ = (x^^i, . . . ,2;^^^). 
By ||x|| for a vector x, we mean the usual Euclidean norm of X G M''. Ma- 
trices are also written in bold and only uppercase English letters are used 
to denote them. If A is a matrix, || A|| will stand for a norm on the space of 
matrices such as the operator norm defined by ||A|| =sup||x||<i ||Ax||. Let 
B{c, R) = {z gR'^ -.{{z - c\\ < R} denote a bah in R'^ with center c G M"* and 
radius R > 0. Define a cube around a point a = (ai, . . . , a^) G with an 
edge length 26 by 

(3) C7(a,(5) = {xGM'^:xfcG [afc-(5,afc + (5],A; = l,...,(i}cM'^. 

If a = 0, then we write C{5) for C(0, 5). 

We shall use the multi-index notation i = (ii, . . . , i^) G N'^. For a multi- 
index i, a vector x G M'^ and a sufficiently smooth function f of d variables, 
define 



d 



= Y,ik, il = l[ikl, x' = n4N D7(xo) 

k=l k=l 

For k,d,r £ N, define 



X 



fc=l k=l k=l '^^l ' ' ' ^^d 



hid) = {i£N'':h + --- + id = k}, I{r,d) = \Jlkid), 

k=0 

with Io('^) = {(0, • • • ,0)}. For convenience in writing, I(r, d) will be enumer- 
ated by stacking elements oflo{d),li{d), . . . ,Ir{d), in that order. Within each 
Ik{d), the elements are arranged following the lexicographic (or dictionary) 
ordering. Observe that Ik{d) and Ii{d) introduced above are disjoint if fc / L 
The cardinality jlfcl*^)! is the number of d-tuples (fci, . . . , k^) G N*^ such that 
ki + ■ ■ ■ + kd = k, or equivalently, the number of ways to put k balls in d 
boxes. Thus \Ikid)\ = ('^+^"^) = C^^^^^^), and hence 



\Iir,d)\=^Md)\=Yl 

k=0 k=0 



d + k-l 
d-l 



In particular, |I(r, 1)| = r + 1. 

For an a G M, let [a] be the smallest integer bigger than or equal to a. 
Then = [a — 1] stands for the largest integer which is strictly less than 
a. Clearly, if a G N, then = a — 1. 
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For Q,L > and a compact, convex set D C W^, introduce an isotropic 
Holder functional class Tiaict, L, D), consisting of ra-times differentiable 
functions / : Z) — )• M such that 

(4) |/(x)-Pj,,„(x)| <L||x-xo|r, x,xoeA 
where 

(5) ^'/,xo(x)= ^D7(xo)(x-xo)' 

iel(rc,rf) 

is the Taylor polynomial of order obtained by expansion of / about the 
point xq. 

Put q{a,d) = \I{ra,d)\ — 1. Observe that the total number of terms in the 
d-variate Taylor polynomial Py^xo(x) of order defined in (5) is q{a,d) + 1. 

For a function (7 : M'^ — t- M such that all second-order partial derivatives of g 
exist at a point xq G W^, denote by Hg{x.Q) the Hessian matrix of the function 

g at the point xq, whose (i,j)th entry is given by g/.^g°^ , i,j = l,...,d. 
Notice that if g has continuous second order partial derivatives at xq, then 
the Hessian matrix Hg{xo) is symmetric, and hence its eigenvalues must 
be real. For a symmetric matrix M, denote by Aniin(M) and Amax(M) the 
smallest and the largest eigenvalues of M, respectively. 

Consider the model (1) with / : D — )• M. We now describe the assumptions 
on / to be used throughout the paper. 

(Al) The function /(x), x G D C M'^, allows extension on a slightly bigger 
set = U^g£, i?(x, e) for some e > (in order to avoid boundary effects) 
and belongs to an isotropic Holder functional class T-L^[a,L,D^) defined by 
(4), with L > and a > 2. 

o 

(A2) There is a unique point /i in the interior D of D that maximizes 
the function / on D, that is, M = sup^g/j /(x) = max ° /(x) = /(/^) and 

Amax(^/(/x))<0. 

Note that conditions (Al) and (A2) imply that V/(/i) = 0, and the Hes- 
sian Hf{fi) is a symmetric and negative definite matrix. Besides, as a > 2, 
the Hessian matrix Hf{x) is continuous and therefore for some k,Xq > 0, 

(6) sup Amax(-H'/(x)) < -Aq. 

xeB(/x,K) 

Notice that constants k, Aq depend on /. If we do not pursue any uniformity 
over / in our results, then the condition (6) follows from (Al), (A2) and can 
therefore be used in the proofs. However, when uniformizing the results over 
a functional class, this condition becomes autonomous and must be added 
to the description of the functional class; see Remark 1 below. 
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3. The two-stage procedure. For a column vector i? = (^^i : i G I{ra,d))'^ , 
introduce the multivariate polynomial function 

(7) /^(x)= ^i^'=i; E ^i^'- 

ieI(rQ,d) A:=OiGlfc(d) 

We now describe the two-stage procedure for estimating the parameters 
(/Lt,M). The first stage consists of the first two steps and the steps 3-5 
comprise the second stage. 

(1) The first stage starts as follows. For v € (0, 1), choose first stage design 
budget, that is, ni G N such that < ni < n, ni/n — v. Find rii design points 
{xj, i = 1, . . . , ni} approximately uniformly over the set D in the sense that, 
for some ci, C2 > 0, the family of balls {i?(xj, ci?i~^/'^), i = 1, . . . , ni} covers 
D and ||xj — Xj|| > C2n~^^'^ for i ^ j. 

Observe the data PJ' = {(xi.,yfc), A; = 1, ... ,ni}, Yk = f{^k)+ik, k = 
1 , . . . , ni , according to the model ( 1 ) . 

(2) Using Dl, construct a preliminary consistent estimator /i of /i. For 
d= 1, one may use the kernel estimator of Miiller (1989) and for d>2, its 
multivariate generalization given by Facer and Miiller (2003). 

(3) Let = n — ni be the remaining portion of the design budget, and let 
/ be the smallest integer that satisfies 21 >ra- Assume that n2 = n^i^l + 1)*^ 
for some G N, which is always possible to arrange. Note that ns > cn for 
some constant c > 0. Introduce a localization parameter 5n > 0, (5„ — t- 0, and 
define the set 

d 

Y[{l^k + jiSnJi = 0, ±1, ±2, . . . , ±1} = {di, . . . , d(2i+i)4> 

1=1 

which consists of {21 + l)'^ different points from the d-dimensional cube 

C7(A,Wn). 

Now introduce the second stage design points {x^. A; = 1, . . . , ^2} in such 
a way that \Ij\ = 123 for all j = 1, . . . , {21 + l)'^, where Ij = {1 < k < n2 ■ :x.k = 
dj}. In other words, each point among the {21 + 1)'^ different points from the 
set {di, . . . ,d(2/+i)d} is repeated = 77-2/(2/ + l)'^ times in the second stage 
design {x^, /c = 1, . . . , n2}- Observe the data = {(xfc, Ifc), A; = 1, . . . , 722}, 
Yfc = /(xfc) + ^fc, A; = 1, . . . , 77-2, according to the model (1). 

(4) Introduce the column vectors Y = (Yi, . . . , Yn2)'^ , x^ = (x^ : i G I{ra, d))'^ , 
A; = 1, ... ,712, and form the data-matrix X = (xi, . . . ,5tn2)'^ of dimension 
77,2 X {Q{ra, d) + 1). Now using I?2 5 a polynomial regression model of order 
ra by 

■d = argminy^(yfc — /^(x^))^ = argmin ||Y — Xi?|p, 
k=l 
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where the polynomial is introduced by (7). The unique least squares 
solution is given since X is full-rank by Lemma 1 

below. Intuitively, this is expected since the number of observations n2 > 
{21 + 1)'=' > (r„ + 1)"^ > |I(r^, d)| = qia, d) + l. 

(5) Finally, define the two-stage estimator (/x, M) of (/x, M) by 

(8) A = arg max (x) , M = f^{fi). 

xGC{/j,,W„) 

Note that (/i, M) depends on the first-stage estimator and the localization 
parameter (5„ introduced in step 3. 

Clearly the construction of the two-stage procedure does not assume the 
knowledge of the error variance provided that the preliminary estimator 
jl also does not use this knowledge. Furthermore, the two-stage approach 
simultaneously estimates /x and M, since the same design points for both 
estimators are used in the procedure. The two-stage procedure also provides 
improved estimators for all the relevant derivatives of / at fi; see Remark 7 
below. 

The following theorem gives the rate of convergence of the two-stage pro- 
cedure for any smoothness level a> 2. 

Theorem 1. Suppose that the localization parameter 6n satisfies \fnb\ — t- 
oo and ||/i — =0p((5„). Then under conditions (Al) and (A2), 

(9) II A - /^ll = Op(n-i/25-i) + Op(5r 1) 
and 

(10) M-M = 0p{n-'/^) + 0p{6'^). 

Condition ||/x — /x|| = Op(J„) has a clear heuristic interpretation: at the 
second stage, one should not localize more than what the accuracy of the 
estimation procedure allows at the first stage. Actually, it is sufficient to 
assume that P(||/i — /^|| < K5n) — )■ 1 for some K, but the dependence of K 
on unknown quantities will complicate the analysis. 

We first observe that there is always a rate improvement from the first 
stage to the second if (5„ is chosen properly. To see this, let e„ be the rate 
of convergence of p,. Since En cannot be better than the optimal rate of 
convergence of all possible sequential procedures, which is we 
have £n ^ n~^'^~^^^^'^"\ Choose 5„ = max(m„en, n.""^/'-^")), where m„ is a 
positive sequence going to infinity sufficiently slowly. Then ||/i — mII = Op{6n) 
and "v/ri'^n ~^ (as a>2) are satisfied, and hence it remains to show that 
5^-1 = 0(e„) and n-^a^-i = 0(e„). If e„ < n-V(2«), then Sn = n-^/^^") 
and the second stage rate is n~^^'^6~^ = S^~^ = n~("~^)/(^°) = 0(e„), and 
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the order improves strictly unless x 72~(<^~i)/(2") _ Clearly, in this case, the 
choice of 5„ is optimal as it balances the "order of variability" n~^^'^5~^ and 
the "order of bias" On the other hand, if e„ > n~^/(^'^\ 5n = mnEn, 

so n~^/^(5~^ = o((5^~^) and the second stage rate is 6"~^ = m"~^e"~-^ <C 
En, since a > 2 and nin grows sufficiently slowly. Note that the "optimal 
choice" 6n = prohibited in this case since we need e„ = o((5„). For 

estimating M, the rate of convergence of the two-stage procedure clearly is 
max(n~^/^, m"e"), which matches the optimal rate n~^/^ if e„ ^ 
Of course, if the choice of 5n is too big, then the rates for estimating /i or 
M may deteriorate in the second stage. 

Clearly, it is natural to use a preliminary estimator p, with the fastest 
possible convergence rate e„ = n"^""^^/^^""'"'^^ for any nonsequential proce- 
dure. Then the two-stage estimator will lead to the best possible convergence 
rates n"^"'^^^^"^"^ and n~^/^ for estimating /x and M, respectively, among all 
sequential procedures, provided that e„ = o(n~^^^^). This condition holds 
when 2a+d ^ 2a' equivalently, a > 1 + ^/l + d/2. Indeed, under this con- 
dition, the two-stage procedure achieves the optimal rates n"^""^^/*-^"^ and 
n"^/^ for estimating fi and M, respectively, even when a rate-optimal es- 
timator is not used, as long as the convergence rate e„ of the preliminary 
estimator is faster than n"^/^^"^ If the condition e„ = o(n~^/(^")) fails, the 
two-stage procedure does not give the optimal rate. In Section 4, we dis- 
cuss a multi-stage generalization that can achieve optimal rate starting with 
almost any first-stage estimator. 

The following corollary summarizes our conclusions. 

Corollary 1. Suppose that a > 1 + + d/2 and conditions (Al), 
(A2) hold. If the convergence rate of the preliminary estimator is faster 
than and the localization parameter is 6n = n~^/^'^°'\ then 

Wfi - = Op(n-("-^)/2a)^ M-M = Op(n-i/2). 

Interestingly, dimension d, which affects the first-stage optimal conver- 
gence rates n~^"~^^^^'^'^'^'^'^ and 7^~°/(2Q!+d) £qj. estimating /x and M, respec- 
tively, does not affect the corresponding two-stage and fully sequential op- 
timal convergence rates n~^"~^^/^^"^ and Thus the curse of dimen- 
sionality is nearly avoided by the two-stage procedure, provided that the 
regression function is sufficiently smooth to ensure a > 1 + + d/2. The 
lower bound in this inequality increases with the dimension d. Notice that if 
a > 3, the corollary yields the optimal rates for estimating /x and M for all 
the dimensions for which 3 > 1 -|- + d/2, that is, up to dimension d = 6, 
including the most important dimensions d= 1,2,3. 

Remark 1. We can formulate a uniform version of Theorem 1. By in- 
specting the proofs, we see that all the bounds for the two-stage procedure 
can be made uniform over the Holder class 7^^(0,^,1?) if we additionally 
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require relation (6) for some k, Xq > 0, the uniform boundedness of all the 
partial derivatives involved in the definition of Tid and the uniformity of the 
first stage estimator. 

To be more specific, for some positive a, L, Li, k, Aq, ki, 5 and e, such that 
Q > 2 and ki < k, and a compact convex D C R*^, introduce the following 
conditions: 

(Al) f end{a,L,D^) and supxg£,. |D7(x)| < Li for all i e ][(r„, d). 

~ o 

(A2) There is a unique point fx G D that maximizes the function / on 
D, supxg£,/(x) = max^^^/(x) = f{fi). Moreover, /(/x) > /(x) + 6 for all 

X ^ B{fl, Ki) and SUp^gBC/x.^) Amax(i^/(x)) < -Aq. 

Let Tid be the class of functions which satisfy (Al) and (A2). Then Theo- 
rem 1 holds uniformly in / G "H, provided ||/i — = Op{6n) holds uniformly 
over H. Condition (Al) is a strengthened version of (Al), namely (Al) is 
complemented by the requirement of uniform boundedness of all the rele- 
vant partial derivatives. Condition (A2) is in turn a stronger version of (A2): 
relation (6) is included in (A2) with common k and A for the whole class, 
and the existence of a unique location fj, of maximum is strenthened by the 
requirement of the uniform separation of the maximum function value f{n) 
from the function values outside B{n, ki). Inside this vicinity, as ni < k, the 
separation of the maximum can be characterized by the Taylor expansion 
and (6); see the arguments in (32) below. This uniform separation condition 
is essential to make the first stage rate for fi uniform over the functional 
class. Note that the separation condition for any particular function holds by 
the compactness of D and the uniqueness of the location of the maximum. 

On the other hand, the two-stage procedure can achieve improved rates 
only under a local Holder condition satisfied in a neighborhood of fi pro- 
vided that a first stage estimator with sufficiently good rate is available as a 
preliminary estimator. This is due to the fact that the second stage design 
points are chosen close to the preliminary estimate, and hence close to the 
true maximum location fi. 

Remark 2. Almost sure convergence of fi and M can be obtained as- 
suming that the preliminary estimator convergence rate is given in the al- 
most sure sense. This will follow from the estimates given in Lemmas 2 and 3. 
Under additional moment conditions on the error distribution, almost sure 
convergence rate of a kernel-type estimator can be found. 

4. Multi-stage procedures and resolving the curse of dimensionality. The- 
orem 1 shows that for estimating the maxima /x and maximum value M of 
a Holder a-smooth function /, a > 2, starting with an estimator p, having 
convergence rate e„ and localization parameter (5„ , a two-stage estimator has 
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an improved rate of convergence, unless e„ is already equal to the optimal 
rate n"^'^"^^/^^"^ More precisely, assuming that e„ = 0{n~'^) for some 7 > 
and choosing 5„ = TQ.a.'x.{mn£rii'n~^^^'^°^^), where m„ —t- 00 is a slowly varying 
sequence, the convergence rate for estimating improves to the optimal rate 
^-{a-i)/{2a) ^£ _ ^^j^-i/{2a)^ up to a slowly varying factor, if 

En ^ n~^/(^"). Although the latter rate is not optimal, further improvement 
in rate can be achieved by applying the two-stage technique again, using 
the estimator obtained in the second stage as the new preliminary estima- 
tor, and repeating the procedure until the optimal rate is obtained. After 
k iterations of the two-stage procedure, the convergence rate thus becomes 
e"n up to a slowly varying factor, provided that ^ 7),-("-i)/{2a) , 

Let ki be the largest integer such that the last relation holds for k = ki. 
Then iterating the two-stage procedure ki + 1 times, the resulting estimator 
will have the optimal convergence rate n~(°~^)/(^") . Thus the final multi- 
stage procedure has convergence rate completely free of the dimension d and 
applies to any smoothness level a > 2. In order to apply the procedure in 
ki + 1 stages, one will need to split the observation budget in ki + 1 parts 
following the description given in step 3 of the procedure. 

If we are interested only in estimating the maximum M, we may be able 
to stop earlier when applying the multi-stage procedure. In this case, the 
target optimal rate is n~^/^. The two-stage estimator has convergence rate 
given by max{e", n~^/^}. Hence the optimal rate will be obtained at stage 
k2 + 1, where k2 is the largest integer integer such that e"'' 3> n~^/^. 

Remark 3. The smoothness level a needs to be strictly greater than 2 to 
control the error in the second-order Taylor approximation of the underlying 
multivariate regression function. As a gets closer to 2, the required number 
of stages in the multi-stage procedure increases without bound. 

Consider now the adaptive version of our original estimation problem, 
where the problem is to estimate /x at the optimal rate and 
M at rate without knowing the smoothness level a. Since the choice 

of the localization parameter 5n depends on the knowledge of a it is not 
possible to apply the two-stage procedure, and hence a multi-stage adaptive 
estimator for /i is not possible. However, for estimating M, it is possible 
to construct a multi-stage procedure with convergence rate without 
knowing a, as long as a > 2. Start with an estimator for ^ which con- 
verges at rate for all Holder 2-smooth functions. For instance, the rate 
possible in dimension d by the results of Miiller (1989) and Facer 
and Miiller (2003). Then by applying Theorem 1 with J„ = m„n~'^, where 
m„ —7- 00 is a slowly varying sequence, the rate of convergence for estimating 
M improves to max{ m^n~^'^, n~^/^} in stage two. Repeating the two-stage 
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Fig. 1. Top left: true regression function (surface) and stage one observations (gray 
points). Top right: surface fitted through stage one observations, the initial estimator 
(red point) and the stage two observations (gray points). Bottom: quadratic surface fit- 
ted through the stage two observations and final estimator (green point). 

procedure k times, thus the rate will improve to n"^/^, whenever 2^/3 > |, or 
k > (log(l//3)/log2) — 1. In particular, starting with the one-stage optimal 
estimator having convergence rate n~^^^'^~^^\ the required number of stages 
to achieve n~^/^ rate at all Holder 2-smooth functions is the smallest integer 
greater than (log(4 + (i)/log2) — 1, since repeating the two-stage procedure 
beyond k given above does not hurt the rate. 

5. Simulations. In this section, we compare the performance of the two- 
stage procedure with an equivalent single-stage procedure. We consider the 
bivariate case d = 2 and take a regression function / : [0, 1]^ ^ M defined by 

/(x, y) = 5x{x — l)y{y — 1) sin(llx) sin(lli/) 

(the smooth surface in the top left panel of Figure 1). In the first stage the 
function is observed with Gaussian noise with standard deviation a = 0.1 
on a regular 25 by 25 grid (gray points in the same panel). Using standard 
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local linear regression, a surface is fit through these points (the surface in 
the top right panel of Figure 1) and the point where this fitted function 
is maximal serves as the stage one estimator /i (the red point in the same 
panel). Next we take 6 = 0.1 and generate 70 new observations at each of the 
nine points (/ii + ji6,fl2 + J2^)i ji,j2 = 0, ±1 (gray points in the top right 
panel of Figure 1). Finally a quadratic surface is fitted through these new 
data points (the surface in the lower panel of Figure 1) and the location of 
the maximum is the final second stage estimator p, (the green point in the 
figure). The implementation of the procedure is rather straightforward. In 
the statistical language R, we used the standard function loess in the first 
stage to fit the surface using the first stage observations, and we used the 
function Im to fit the quadratic surface using the second stage observations. 

Note that in total we have used 25 x 25 + 9 x 70 = 1255 observations. It is 
illustrative to compare our procedure to a single stage estimator that uses 
about the same amount of regularly spaced observations. The closest is a 
regular 36 by 36 grid, which contains 1296 points. We make noisy obser- 
vations of the function / at these grid points, again corrupted by centered 
Gaussian noise with standard deviation 0.1. We consider the estimator for 
the location of the maximum of / that is obtained by fitting a locally linear 
surface through these data points and computing the location where this is 
maximal. Obviously, the quality of this estimator depends on the bandwidth 
that is used (or span parameter, as it is called in the R function loess). To 
obtain a fair comparison with our two-stage estimator, we should make an 
optimal choice. We achieve this by repeating the experiment a large number 
of times with different bandwidths and computing numerical mean squared 
errors (MSEs). The result is shown in the left panel of Figure 2. The numer- 
ical MSE is minimal for the bandwidth choice h = 0.085. 
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h 

Fig. 2. Left: MSE of the single-stage estimator against the bandwidth used. Right: boxplot 
of the errors of the single-stage estimator (with optimal bandwidth choice /i = 0.085j and 
the errors of our two-stage procedure (with the same bandwidth choice and 5 — 0.1). 
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Fig. 3. MSB of the two-stage estimator against the bandwidth h used in the first stage 
(left), and against the localization parameter 5 (right). The dashed line is the MSB of the 
single-stage estimator with optimal bandwidth choice. 



To compare the mean-squared error of the single-stage estimator based on 
this regular grid, we replicated the experiment 10,000 times and computed 
the Monte-Carlo average of the squared difference between the estimate and 
the true maximum. The results are shown in the left boxplot in the right 
panel of Figure 2. Similarly we carried out the two-stage procedure 10,000 
times (with bandwidth h = 0.085 in stage one and 6 = 0.1) and computed 
the errors as well. These are shown in the right boxplot in the right panel 
of Figure 2. It is clear that the two-stage estimator performs better in this 
situation, in terms of the mean-squared error. This is in spite of the fact 
that the two-stage estimator has used less observations, namely 1255 in 
total compared to 1296 used by the single-stage estimator. 

In practice the quality of our procedure clearly depends on the quality 
of the estimator that is used in the first stage and also on the choice of 
the localization parameter 6. In this simulation example, where we use local 
linear regression in the first stage, the quality of the estimator therefore 
depends on the bandwidth used in stage one. To investigate the dependence 
of the performance on this parameter we carried out the simulation study 
described above for a range of bandwidths. The results are shown in the 
left panel of Figure 3. The solid line gives the MSE of our estimator as a 
function of the bandwidth used in stage one. The dashed line is the MSE of 
the optimal single stage estimator described above. The plot shows that in 
fact for a range of bandwidths the two stage procedure performs better than 
the single stage procedure. Similarly, the right panel of Figure 3 describes 
the performance of the two-stage procedure as a function of the localization 
parameter 5. Again there is a range of possible J's for which we obtain an 
improved performance, but choosing 6 too small or too large deteriorates 
the quality. In practice one might, for instance, use cross-validation type 
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methods to set the tuning parameters h and 5. Further research is needed 
to find theoreticaUy sound methods. 

6. Proofs. Throughout this section, a and d are are kept fixed. To sim- 
pHfy notation, we abbreviate I{ra,d) by I, Ik{d) by 1^ and q{a,d) by q. 

First we introduce several quantities we are going to use in the sequel. 
Define Zfc = Xfc — /i, k = 1, . . . ,n2, and reformulate definition (8) by repre- 
senting the involved quantities in terms of the newly defined shifted design 
points Zfc, /c = 1, . . . , n2. Let dj = dj — fi, j = 1, . . . , (2/ + 1)'^. Then for all 

/C = 1, ... ,712, 

(11) Zfc G {0, ±5n, . . . , ±l6nY = {dl, . . . , d(2i+i)4 C C{l5n), 

so that each of the distinct {21 + 1)*^ points are repeated = n2/{2l + 1)*^ 
times in the new design set {zk,k = 1, ... ,712}. Using definition (7), define 
an estimator 6 by equating the two polynomials 

(12) /^(x - A) = /^(x) equivalents 9 = (Z^Z)-^Z^Y, 
where 

(13) Z=(zi,...,z„,)^, Zk = {z[:i£lf 

and Zfc = Xfc — /X, k = 1,. . . ,n2- The matrix Z"^Z is invertible by Lemma 1 
below. We thus obtain an equivalent description of the estimator (/i, M) 
given by (8) in terms of the polynomial fg{z) defined by (7), with 9 defined 
by (12), 

(14) fi = il + fl, M = f'g{^i) where /i = arg max /^(z), 

zec(w„) 

C{l5n) = [-/5n, l^nf C R'^ . In doing this shifting trick, we make the compu- 
tations easier because the matrix Z^Z will have a lot of zero entries as the 
design points z^'s are symmetrically centered around zero in each dimension 
rather than being centered around fx. 

Next, let the vector = (0; : i G I)^ be defined by the equality of the two 
polynomials /^(x — jl) = Pj^^(x), where Pj^^ is the Taylor expansion of / 
of order around /x defined by (5), 

(15) Y,(^,{^-~^r=f{^^)+ E ^^(x-/^)S 

igl i6l,|i|>2 

here we have used the condition V/(/x) = 0, due to (Al) and (A2). Thus, 9 
is a random vector depending on /, /i and fi. From (15) it follows that 

(16) iWi = D'P;,^(/i), D7(/x) = D%(/x-/i), iGl. 

The next lemma ensures that the estimator (12) is well defined; that is, 
Z^Z is invertible. 
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Lemma 1. The columns of matrix Z (andX.) defined by (13) are linearly 
independent. 

Proof. Consider the matrix Z; the same proof apphes to X. 

For multi-indices i G and j € N*, define the concatenation operation 
(i,j) = (ii,...,ip,ji,...,j^) eNP+^ In particular, for /c G N, i G N^, (fe,i) = 
(/c, ii, . . . , ip). Introduce the following notation: for Z = 0, 1, . . . , d — 1 and z = 
(zi, . . . , Zrf) G define i_; = (i^+i, . . . , i^) G N"^"', z_i = {zi+i, . . . , z^) G M"^"^ 
and the set . . . , i«) = {i G N'^"^ (ii, . . . , i/, i) Gl}. 

Let Ci, i G I, be the columns of the matrix Z. We need to show that 
AiCi = implies that Ai = for all i G I. The equality ^'S^i = is 
equivalent to 

= ^AiZfc, A; = 1,2,..., 712- 

iei 

Among {zi, . . . ,z.„2}, only {21 + l)'' are distinct — {di, . . . ,d(2/+i)<i} given 
by (11). Thus, for all zG {di, . . . , d(2;+i)d}, 

iel n=0 i_iGl-i(ii) 

For a fixed z_i = (z2, . . . , -z^), the right-hand side of the last relation is a 
polynomial of order in variable zi. But we have 21 + 1 > r^ different 
design values : j = 0, ±1, ±2, . . . , it/} of the variable zi for which this 
polynomial must take the zero value. This forces all the coefficients of this 
polynomial to be zero. Thus we have that 

= ^ Xi^i_^z'-^ , ii = 0, 1, . . . , r„ 

i-iGl-i(n) 

for all possible design values of z_i = (z2, • ■ • , Zd)- Iterating the above rea- 
soning up to the variable Zd leads to, for all ii,. ■ ■ , id~i = 0, 1, . . . , r^, Zd G 
{0, ±5n, ±2(5„, . . . , ±Wn}, 

id&-(d-i){h,i2,---,id-i) 

from which we derive that Ai = for all i G I. □ 

Remark 4. In the case (i = 1, X and Z are Vandermonde matrices. 

The next lemma shows that the second stage data can be regarded as 
coming approximately from a certain polynomial regression model. 

Lemma 2. Assume (Al) and let the data {(x^, Y^), /c = 1, . . . , 722} and 
^ = (^1, . . . j.^raj)'^ be given by the second stage observation scheme, Y = 
(Yi,...,y„2)'^, rj = {r]i,...,r]n2f with r]k = /{y.^) - Pf^fj,{jiik) . Then Y = 
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Z9 + T] + where Z and 9 are defined by (13) and (15), respectively, and 
rj is independent of ^. Moreover, for some constants Ci,C2 and uniformly 
in k e {1,2, .. . ,n2}, 

(17) \r,k\<Ci6^ + C2\\p.-tir. 

Proof. Since % = /(xfc) - P/,^(xfc), by (13) and (15), 
Yk = /(xfc) + Cfc = -P/,M(xfc) +r]k + (,k 

= fei^k - P-) + m+Ck = zl9 + r]k + Ck- 

Clearly, rj is independent of ^ by definition. It remains to show (17). Apply 
the Cr-inequality, \a + 6|'' < max(l, 2''~^)(|a|'' + |6|''), r > 0, (4) and the fact 
that ||xfc — /i|| < ^/dl6n, k = 1,. . . ,n2, to obtain (17) 

\Vk\ = |/(xfc) - P/,^(xfc)| < L||xfc - 

< Lc«(iixfe - Air + iiA - /^r ) < Ci'^n + C2\\^l - /xir. □ 

Lemma 1 ensures that the matrix Z^Z is nonsingular. The following 
lemma describes the asymptotic behavior of the elements of its inverse. For 
notational convenience, below we enumerate the rows and columns of ma- 
trices starting from 0. 

Enumerate I by arranging their elements in the order described in Sec- 
tion 2, which we denote by io, . ■ . , ig, respectively. 



Lemma 3. The {i,j)th element hij of (Zi^Z) ^ satisfies 

,q- 



Proof. Since z° = 1 for all /c = 1, . . . , 77-2, we have 



Z^Z: 



?^2 



k=l 

712 "2 



fc=i 

n2 



\ 



k=l 

712 



k=l 
"2 



i=l 



\k=l 



i=l 



i=l 



Then, for some constants Oij, i,j = 0,1,.. 
matrix Z^Z as follows: 



,(7, we rewrite the symmetric 



rli^l + li?! 
'-'■gg'-'n 
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Some entries are easy to compute. For example, aoo = (2/ + 1)*^ since 712 = 
{21 + l)'^n3. Moreover, there are many zeros due to the symmetry of the 
design. In particular, Ojj = for all i,j € {0, 1, . . . , g} such that + |ij| is 
an odd number. However we are not concerned about the exact values aij 
but only about nonsingularity of the matrix A with (i, j)th entry equal to 
aij, i,j = 0,...,q. 

Let A be the diagonal matrix with elements 5n^ ' , j = 0, 1, . . . , g, in that or- 
der. Now notice that Z^Z = 723 AAA. Since Z"^Z is nonsingular by Lemma 1, 
it follows that A is also invertible. Therefore (Z^Z)~^ = n^^ . 
Denote by a*-' the (i,j)th entry of the constant matrix A~^ and recall that 
ns > cn. Then for i,j = 0,l,...,q, 

Remark 5. For d = l and even , we have 21 -\- 1 — Tq + I. Put 69 — 
+ 1, 6m = for all odd m £ {1, . . . , 2ra}, and for each even m G {1, . . . , 2ra} 

6m = 2(1 + 2"^ + 3™ + • • • + r) = 2{1 + 2"^ + • • • + (r,/2)'"}. 

Then the entries of A can be computed as follows: Since n2 = {21 + l)n3, 
Yll—i — ^ each odd m G {1, . . . , 2ra} and 

^z^ = 2n3{/-C + {1- + ■■■ + €} = n,5^b^ 

k=l 

for each even m G {1, . . . , 2ra}, we obtain that aij = bi+j. 

The case of odd can be treated similarly leading to slightly different 
constants. 

Lemma 4. Assume (Al), and let 6 and be defined by (12) and (15), 
respectively. Then 

e, = e, + Op{n~^'H~\'\) + o(5r ''') + o{\\fi - /^rj-i'i), i g i. 

Proof. Using (12) and Lemma 2, write 

(18) e-e = (z^z)~^z^Y -e = {z^zy^z^{rf + Ci- 

Since E(^) = and Cov((Z^Z)-iZ^^) = ^^(Z^Z)-^ the order of the term 
(Z^Z)~^Z"^^ is determined by the diagonal entries of the matrix (Z"^Z)~^. 
Hence, by Lemma 3, we have 

(19) (Z^Z)-IZ^^ = {0,{n-^l%0,{n-yH-\'^\), ©.(n'^^^-l^^l))^. 
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In view of (11), G C(/(5„), so that |z),| < C(5n' , k = 1,. . . , n2, i G I. Using 
this, (17), n2<n and Lemma 3, we obtain that 



(Z^Z)-^Z'^r7 



^00 ^'^k^k^ \- hoq ^ z)!^ 



k=l 

n2 



k=l 

n2 



k=l 



n2 



i=k 



n2 



(20) 



KO Yl "^kVk + --- + hqqY ^'kVk 



k=l 



k=l 



( o(5-)+o(iiA-Mr) \ 
o(5r'''')+o(ii/i-Mr'^n''''; 



Vo(5r"^')+o(ii/i-Mr<^n"^')/ 

Combining relations (18), (19) and (20) completes the proof of the lemma. 
□ 



Let Ij, J = 1, . . . ,d, be the d standard unit vectors of M*^, that is, Ij has 
1 at the jth coordinate and zeros at other d — \ coordinates. Notice that 
Ii = {lj:j = l,...,4. 

Lemma 5. Assume (Al) and let f^, 6 and be defined by (7), (12) and 
(15), respectively. If — = Op{6n), then 

V/^(/x - /i) - Vfe{^l - A) = Op{n~'/^6-') + Op{6^-'). 
Proof. The jth coordinate of the vector V/0(x) is 

^ dxi 



■> iei 



+ E 



iel: ij>l,|i|>2 

where Ij G Ii. Then, for each j = 1, . . . ,d, 



(21) 



dfgifJ'-tJ-) _ dfe{n-ii) 

dxj dxj 

iGl:ij>l,|i|>2 
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Now we bound the right-hand side of (21). Since Ij G Ii, that is, |lj| = 1 
for j = 1, . . . ,d, and — fj,\\ = Op{5n), we obtain by Lemma 4 that 

^ ^ e,^-ei^=Op{n-^/X') + o{d:-') + o{\\p,-t,r6-') 

(22) 

= Op(n-V2j-i) ^ Op(5r i = 1, . ■ • , 

The same argument apphes to each term of the sum in the right-hand side 
of (21): for all i G I such that ij > 1 and |i| > 2 

i(^i-^i)(M-Ar^^i 

< |^i-6'i|||/i-/i||l'-^^l 

(23) = [0,{n-y'6-\'\) + 0(<5r I'l) + 0(||/i - urs-^'^t^ - A^l-^ 

= op(n-i/2^-i) + op(<5r 1) + opdi/i - /xr<^-^) 

= Op(n-i/25-i) + Op(5ri). 

There are fixed number of terms in the sum from (21) and the constant ij is 
at most Tq. Combining this with (22) and (23), we see that the main term 
in (21) is 6i_ —Ox. and therefore 

||V4(/i - A) - V/e(/x -iL)\\ = Op{n~'/^6~') + Op{6:-'). □ 

For an (s x p)-matrix A, let ||A|| = sup^gigP; |ix||<i 11-^^11 t>6 the operator 
norm for the rest of this section and define the maximum norm || A||niax = 
I, where the entries of the matrix A. These norms are 

related by 

(24) ||A||max < ||A|| < ^||A||max- 

Lemma 6. Assume (Al), (A2), — =Op(5„) and -^/nJ^— )-oo. For 
fjL* G such that ||/x*|| = Op(l) and for any fixed e G (0, 1), let 

(25) B„ = {||i//(/.) - Hf^^{n*)\\ < (1 - e)\\{Hf{n)r^r'}. 
Then P(i?n) —^1 as oo, on the event Bn, {Hf^{fi*))~^ exists and 

\\{Hf{^^)r'-{Hf^{^,*)r'\\ = o,{l). 

Proof. Clearly, by the smoothness of a polynomial, 

(26) Hfe{z) = Hfe{0) + O{\\z\\) as||z||^0. 

We note that the elements of the matrix Hfg{0) [resp., Hf^{0)] are linear 

combinations of 9i (resp., ^i), i G 12- From Lemma 4 and the conditions 
a > 2, ll/i — = Op{6n) and \/n6'^ — )• oo, we obtain that 

0i -9, = Op{n-^'H-^) + 0(<5r') + OiWfL - ^|r5-2) = Op(l), i G l2. 
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where vector 6 is defined by (15). Therefore, entry-wise 

(27) Hf^{O) = Hf0{O) + o,{l). 

By (Al), (A2) and the definition (15) of 0, Hf{fx) = Hfeifx - fi). This and 
(26) imply that entry-wise 

(28) Hf{^l) = Hfeifi -ii) = Hfg{0) + 0(||/x - All). 
Combining (26), (27) and (28) leads to the following entry-wise relation: 

Hf^{t,*) = Hf^{0) + O{\\t,*\\) 

= HfgiO) + Opil) + 0{\\t^*\\) 

= Hfitx) + 0(||/x - All) + o,(l) + 0(||/x*||) 

= F/(/x)+Op(l). 

Then WHf^in*) - ///(/x)|Uax = Op(l) and hence, by (24), 

(29) \\Hf{t,)-Hf^{tJ.*)\\=Op{l). 

Next, since (Al) and (A2) imply (6), X^UHfin)) <■■■< X^^,{Hf{fi)) < 
-Ao < 0. Hence, || (F/(/x))-i || = -{X^^^{H fifi)))'^ < Aq \ or 

(30) Ao<||(if/(/.))-i-\ 

Define the event C„ = {||i?/(/i) - Hf^{fj,*)\\ < (1 - e)Ao}. Using (30) and 
Lemma 11, we obtain that 

CnCBnC{{Hf^{^l*)y' exists}. 

In view of (29), P(C„) 1 and hence P(i?n) 1- Finally, by applying (29), 

(30) and Lemma 11 again, we get that on the event Bn 

mfif,)r' - {Hf^{f^*)r'\\ < e-'\\{Hf{f,)yY\\Hf(p) - i^/^(/^*)ii 

< e~Uo-2||F/(/x) - Hf^{fJ,*)\\ = Op{l). □ 

Remark 6. Lemma 6 would still hold if we only assumed that ||/x — All = 
Op{5n) instead of ||/x — All = Op{6n)- 

Lemma 7. Assume (Al), (A2), ||/i — All = Op{6n), \/n6^ — ?■ oo and let 
-^n = {A £ C(2Wn/3)}, where the estimator li is defined by (14)- Then 
P{An) — )• 1 as OO. 

Proof. Bound P(^^) by 

(31) P(A ^ C{2l6n/3),tx -AG C{l5n/3)) + P(// - A ^ C(W„/3)). 
The second term converges to zero by the condition ||/i — All = Op((5„). 
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For a symmetric matrix M and any x € W^, Ainin(M)||x|p < X"^Mx < 
Aniax(M)||xf . Recall that V/(At) = and (6) follow from (Al) and (A2). 
Then, for fi £ C(/i, Wn/3) and x € C{p.,l6n)\C{fi,2lSn/'i), by using Taylor's 
expansion, V/(/x) = and (6), we have 

/(x) = /(/.) + i(x - fj,fHf{fj,*){^ - ^l) 

(32) 

for some positive constant c = c{\q^I) and sufficiently large n such that 
IIa** — mII < with K > from (6). 

Next, by using (4), (15) and the c,. -inequality, 

^(Z) = P;,^(Z + A) = /(Z + + 0(||z|r ) + OdlA - /xir). 

Now we combine this with Lemma 4 and the conditions a>2, ||/x — /i|| = 
Op(5n) and y/n5'^ — )• oo to obtain that, uniformly in z G C{l5n), 

4(z) = /«(z) + Op(n-V2) + o(5-) 

(33) = /(z + A) + 0(||z|r) + 0(||/i - + Op(n-i/2) + 0(5^) 

= /(z + A) + Op(<5^). 

Recall that fi G C{l5n) by the definition (14). By (32) and (33), we see 
that the event 

{A ^ C(2W„/3), /X - A G C(W„/3)} 

= {A + A G C(/i, W„) \ C(/i, 2l6n/3),fi G C(/i, W„/3)} 
implies the event 

/(/x) - c6l > /(A + A) = /^(A) + Op(5^) 

> /^(M - A) + Op('Jn.) = /(m) + OpiSl), 

leading to 

P(A ^ C{2l6j3),fx- A G C(W„/3)) < F{c6l < o.p{5l)) ^ 
as n — 7- 00. Combined with (31), this completes the proof of the lemma. □ 

Proof of Theorem 1. By (Al) and (A2), V/(/2) = 0. According to 
the definition (15) of the polynomial fg, 



(34) 



= V/(/^) = VPf^M = Vfeifi - ii). 
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By (14), max^gc'j-;^^) /^(z) =/^(/i). If this maximum is not attained on 
the boundary of C(/5„), then V/^(/i) must be zero. Hence we have that on 
the event A„ = {/i G C(2/(^„/3)} 

(35) = V/^(A) = V/^(/x - A) + i?4(/i*)(A - (/^ - A)), 

where /i* = (/.tj, . . . , /Li^J) = A/lt + (1 - A)(/x - /x) for some A G [0, 1]. Thus 

= O(IIAII) + 0(||m - AP = Op(<5n) = Op(l). 

By Lemma 6, (i//^(/x*)) ^ exists on the event Bn defined by (25). Rela- 
tions (34) and (35) imply that on the event An H Bn 

A-/i = -(/^4(/x*))-V4(M-/i) 

(36) = -(///^(/x*))-\V4(/z - A) - Vfeifi - A)) 
= -{Hf{f,)y\Vfg{fi -ii)- V/e(/x - ii)) + 

where r„ = [(if/(/i))-i - (if4(/x*))-i](V4(/x - /i) - V fe{^i - ft)) is the 
remainder term. 

By Lemma 5 and (30), we bound the norm of the first term on the right- 
hand side of (36) as 

\\{Hf{^l)r\vf^g{^l -ii)- vfeiti - A))|| 

< Ao iV/^(/x - A) - Vfeip -fi)\\ = Opijn), 

where jn = n~^/'^6~^ + Therefore ||r„|| = Op{l)Op{'jn) = Op{'^n) on the 

event Bn by Lemmas 5 and 6. Consequently on the event An H Bn, we have 

(37) IIA - = Opin-y^n' + = Op(7n). 
For any constant p> 0, 

P(|| A - > Pin) < P({|| A - /^ll > Pin} nAnH Bn) + P(A^) + P(S^). 

The first term on the right-hand side can be made arbitrarily small by 
choosing p sufficiently large in view of (37), uniformly in n, while the other 
two terms converge to zero by Lemmas 6 and 7. This proves (9). 
It remains to prove (10). From (15) it follows that 

M = fifl) =fg{fi-fl)=J2 Oiifi - A)', 

iei 

so that, according to (14), M — M can be written as 

/0(A)-/<?(/^-A) 

(38) =^[e,i,^-e,{ti-iiy] 

iei 

= ^'io-^io+ E E ^i[A'-(M-A)']. 

iei,li|>i iei,li|>i 
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By Lemma 4, the first term in (38) is 

(39) e,,-e,,=0p{n-^'^) + 0.p{5'^). 

From (14), (9) and the conditions y/nd'^ — )• oo, a > 2, — = Op{6n), it 
follows that 

||/x|| = ll/i — < ll/i — /ill + ll/i — /i|| 

(40) 

= Op(n"V25-i) + Op(5°-i) + 0(||/. - All) = o,{6n). 
Using (40) and Lemma 4, each term in the second sum of (38) 

m - < l^'i - ^illlAll''' = Op{n-'/^) + 0^(5^), 

so that, as there are a fixed number of terms in the sum, 

(41) (^i-W' = Op(n-V2) + o^(5«). 
iei,|i|>i 

Now consider the third sum in (38). Combining Lemma 8 with (9), (40) 
and the condition \\fj, — /i|| = Op{6n), we obtain that for any i € I, |i| > 1, 

lA' - (/. - /i)i < IIA - (/X - /i)|| 5^ IIAI|i''-'ll/^ - All'-' 

fc=i 

(42) =||A-/x||J]||AI|i'i^'ll/^-Af"' 

k=l 

= o,(n-V24i|-2) + o^(5«+|i|-2). 

Since D'P^^^(x), i G I, are continuous, they are bounded over the compact 
set D, so that 0i = Op(l), i E I, in view of (16). Because of this and (42), 

(43) Yl ^i[A'-(M-A)']=Op(n-'/') + Op(5°). 

iel,|i|>2 

It remains to handle separately the terms in the third sum of (38) over 
i G Ii, that is, i E I such that |i| = 1. Due to (16) and the condition ||/2 — All = 

Op{5n), 

(44) 9i = n'Pf^^{p,) = 0{\\fx-p,\\) = Opi6n), iGl,|i| = l. 
Then (42) and (44) imply that 

iei,|i|=i 

Finally, combining the last display with (38), (39), (41) and (43) completes 
the proof of (10). □ 
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Remark 7. The above argument for estimating the parameter M = 
f{n) can be refined for the problem of estimating any mixed derivative 
D'/(/i), for i G I, |i| > 2. One can take the estimator D'/0(A) ^ind estabhsh 
in a similar way that 

D7^(A) - D7(/^) = Op(n"V2<5-|i|) + 0^(5--|i|), i g I, |i| > 2. 



APPENDIX 

Lemma 8. For any x,y G M*^ and any i G N*^ such that |i| > I, 

|i| 



|x'-y'| < ||x-y||^||x||l'l" -iiyi 



y' = {x-y)J2^' V 



^^^^^^ i|-fc||^l|fc-i 
k=i 

Proof. We prove the lemma by induction in dimension. For d=l, 

J—k„,k—l 

k=l 



and the statement follows. 

Now we handle the inductive step. Suppose the statement is true for all 
dimensions k = 1, . . . ,d — 1. We want to show that it also holds for the di- 
mension d. Without loss of generality assume that ii > 0. Recall the notation 
x_i = (x2, . . . , Xd), i-i = (z2, • • • , id) that we used in Lemma 1. We have 

i i i ii 1—1 I ?i 1—1 i 

X -y =x -a;/y_;^ +x-^'y_-^^ -y 

= x5^(x^-y^) + (x--yny':^- 
Obviously, |xi| < ||x||, ||x_i|| < ||x|| and |i| = |i_i| Using these rela- 
tions and the assumption of the inductive step, we obtain that 

|i-il 

K^^ll-y'lDl < ImrMlx-i -y-ill l|x-i||l'-l-'l|y-i| 

k=l 



\k-l 



|i|-«i 

E 

fe=l 



<l|x-y|li:iWl'"~'lMl''' 



and 

K^r - yr)y-i I < l|y-iir-M^i - yil E kir^-'l2/il'-' 

k=l 

<||x-y|| E ||x||l^l-'^||yf'-i. 
fc=lil-jl+l 

Combining the last three relations, we obtain the desired result. □ 
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Below, we consider s x s matrices and let I denote the identity matrix of 
order s. Let || A|| be some norm on the space of s x s matrices satisfying the 
multiplicative property ||AB|| < ||A||||B||. For example, the operator norm 
satisfies this property. 

Lemma 9 (Banach's lemma). Let M be a matrix with ||M|| < 1. Then 

I-M is invertihle, (I - M)~i = I + M + H and - M)-i|| < 

(l-||M||)-i. 

The proof of Banach's lemma can be found in many textbooks on func- 
tional analysis. The next two lemmas are essentially adopted from Facer and 
Miiller (2003) with some modifications. 

Lemma 10. Let V he invertihle and W he such that ||W|| < HV"-"^!!^^. 
Then V + W is invertihle and 

liv-i|| 



|V|| + ||W||)~^<||(V + W)-i||< 



1 - iiv-^w 



Proof. Since ||V"^W|| < 1 due to the condition ||W|| < ||V^i||~i, the 
matrix (1 + V-^W) is invertihle and ||(I + V-^W)"!]! < (1 - ||V-iW||)-i 
by Banach's lemma. Therefore, V + W = V(I + W) is also invertihle 
and 

ll(v + w)-i|| = 11(1 + v-iw)-V"i|| 

< ||v-i||||(i + v-iw)"^|| < ^" 



1- iiv-iwir 



Now, using II V + W|| < || V|| + ||W|| and the invertibility of V + W, we 
obtain ||(V + W)-i|| > ||V + W||-i > (||V|| + ||W||)-^ □ 

Lemma 11. Let A he invertihle and B he such that ||A — B|| < (1 — 
e)||A~-'^||~^ for Sonne eG (0,1]. Then B is invertihle and 

||B"^ - A"^|| <e~^||A~^f ||A-B||. 

Proof. Write B = A + (B — A) and apply Lemma 10 with V = A and 
W = B — A to conclude that B is invertihle and, as ||A^^(B — A)|| < 1 — e 
by the condition of the lemma, 

- i-iiA-V-A)ii - 1-a-^)^'"'"'^"'"- 

By using the last relation, we complete the proof, 
||B-i - A-i|| < ||A-i||||AB-i - III 

< ||A~^||||A-B||||B~^|| <e-^||A-^||^||A-B||. □ 
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